In This Phase I Want To :
discover of smart device data to gain insight of what consumers prefer.
Improve marketing strategy for the bellabeat company.
The available information is related with 3 time variables,which is day, the minute, and the second To simplify the analysis, I will choose the files related to the day because they store information for a longer time and therefore more accuracy
| Id | ActivityDate | TotalSteps | TotalDistance | TrackerDistance | LoggedActivitiesDistance | VeryActiveDistance | ModeratelyActiveDistance | LightActiveDistance | SedentaryActiveDistance | VeryActiveMinutes | FairlyActiveMinutes | LightlyActiveMinutes | SedentaryMinutes | Calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 4/12/2016 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
| 1503960366 | 4/13/2016 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
| 1503960366 | 4/14/2016 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
| 1503960366 | 4/15/2016 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
| 1503960366 | 4/16/2016 | 12669 | 8.16 | 8.16 | 0 | 2.71 | 0.41 | 5.04 | 0 | 36 | 10 | 221 | 773 | 1863 |
| Id | ActivityDay | Calories |
|---|---|---|
| 1503960366 | 4/12/2016 | 1985 |
| 1503960366 | 4/13/2016 | 1797 |
| 1503960366 | 4/14/2016 | 1776 |
| 1503960366 | 4/15/2016 | 1745 |
| 1503960366 | 4/16/2016 | 1863 |
| Id | ActivityDay | SedentaryMinutes | LightlyActiveMinutes | FairlyActiveMinutes | VeryActiveMinutes | SedentaryActiveDistance | LightActiveDistance | ModeratelyActiveDistance | VeryActiveDistance |
|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 4/12/2016 | 728 | 328 | 13 | 25 | 0 | 6.06 | 0.55 | 1.88 |
| 1503960366 | 4/13/2016 | 776 | 217 | 19 | 21 | 0 | 4.71 | 0.69 | 1.57 |
| 1503960366 | 4/14/2016 | 1218 | 181 | 11 | 30 | 0 | 3.91 | 0.40 | 2.44 |
| 1503960366 | 4/15/2016 | 726 | 209 | 34 | 29 | 0 | 2.83 | 1.26 | 2.14 |
| 1503960366 | 4/16/2016 | 773 | 221 | 10 | 36 | 0 | 5.04 | 0.41 | 2.71 |
| Id | ActivityDay | StepTotal |
|---|---|---|
| 1503960366 | 4/12/2016 | 13162 |
| 1503960366 | 4/13/2016 | 10735 |
| 1503960366 | 4/14/2016 | 10460 |
| 1503960366 | 4/15/2016 | 9762 |
| 1503960366 | 4/16/2016 | 12669 |
| Id | SleepDay | TotalSleepRecords | TotalMinutesAsleep | TotalTimeInBed |
|---|---|---|---|---|
| 1503960366 | 4/12/2016 12:00:00 AM | 1 | 327 | 346 |
| 1503960366 | 4/13/2016 12:00:00 AM | 2 | 384 | 407 |
| 1503960366 | 4/15/2016 12:00:00 AM | 1 | 412 | 442 |
| 1503960366 | 4/16/2016 12:00:00 AM | 2 | 340 | 367 |
| 1503960366 | 4/17/2016 12:00:00 AM | 1 | 700 | 712 |
| Id | Date | WeightKg | WeightPounds | Fat | BMI | IsManualReport | LogId |
|---|---|---|---|---|---|---|---|
| 1503960366 | 5/2/2016 11:59:59 PM | 52.6 | 115.9631 | 22 | 22.65 | True | 1.462234e+12 |
| 1503960366 | 5/3/2016 11:59:59 PM | 52.6 | 115.9631 | NA | 22.65 | True | 1.462320e+12 |
| 1927972279 | 4/13/2016 1:08:52 AM | 133.5 | 294.3171 | NA | 47.54 | False | 1.460510e+12 |
| 2873212765 | 4/21/2016 11:59:59 PM | 56.7 | 125.0021 | NA | 21.45 | True | 1.461283e+12 |
| 2873212765 | 5/12/2016 11:59:59 PM | 57.3 | 126.3249 | NA | 21.69 | True | 1.463098e+12 |
We can see there some problems with this data needed to be fixed which are :
In this phase i will cleaning and processing my data to make sure is ready to analysis and make decision after that.
library(tidyverse)
library(kableExtra)
library(scales)
library(highcharter)
library(RColorBrewer)
First let’s take a look of data attributes
str(daily_activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(daily_intensities)
## 'data.frame': 940 obs. of 10 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
str(daily_steps)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ StepTotal : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
str(daily_sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight_info)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
As we can see there are Incorrect types of variables
daily_activity<-daily_activity%>%mutate(ActivityDate = as.Date(ActivityDate,format = "%m/%d/%y"))
daily_calories<-daily_calories%>%mutate(ActivityDay = as.Date(ActivityDay,format = "%m/%d/%y"))
daily_intensities<-daily_intensities%>%mutate(ActivityDay = as.Date(ActivityDay, format = "%m/%d/%y"))
daily_steps<-daily_steps%>% mutate(ActivityDay = as.Date(ActivityDay,format = "%m/%d/%y"))
daily_sleep<-daily_sleep%>%mutate(SleepDay = as.Date(SleepDay,format="%m/%d/%y"))
weight_info<-weight_info%>%mutate(Date = as.Date(Date,format="%m/%d/%y"))
options(scipen = 999)
daily_activity<-daily_activity %>% rename(date = ActivityDate)%>% clean_names()
daily_calories<-daily_calories %>% rename(date = ActivityDay) %>% clean_names()
daily_intensities<-daily_intensities %>% rename(date = ActivityDay)%>% clean_names()
weight_info<-weight_info %>% rename(date = Date)%>% clean_names()
daily_steps<-daily_steps %>% rename(date = ActivityDay,TotalSteps=StepTotal)%>% clean_names()
daily_sleep<-daily_sleep %>% rename(date = SleepDay)%>% clean_names()
Let see our data how’s look
| id | date | total_steps | total_distance | tracker_distance | logged_activities_distance | very_active_distance | moderately_active_distance | light_active_distance | sedentary_active_distance | very_active_minutes | fairly_active_minutes | lightly_active_minutes | sedentary_minutes | calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 2020-04-12 | 13162 | 8.50 | 8.50 | 0 | 1.88 | 0.55 | 6.06 | 0 | 25 | 13 | 328 | 728 | 1985 |
| 1503960366 | 2020-04-13 | 10735 | 6.97 | 6.97 | 0 | 1.57 | 0.69 | 4.71 | 0 | 21 | 19 | 217 | 776 | 1797 |
| 1503960366 | 2020-04-14 | 10460 | 6.74 | 6.74 | 0 | 2.44 | 0.40 | 3.91 | 0 | 30 | 11 | 181 | 1218 | 1776 |
| 1503960366 | 2020-04-15 | 9762 | 6.28 | 6.28 | 0 | 2.14 | 1.26 | 2.83 | 0 | 29 | 34 | 209 | 726 | 1745 |
| id | date | calories |
|---|---|---|
| 1503960366 | 2020-04-12 | 1985 |
| 1503960366 | 2020-04-13 | 1797 |
| 1503960366 | 2020-04-14 | 1776 |
| 1503960366 | 2020-04-15 | 1745 |
| id | date | sedentary_minutes | lightly_active_minutes | fairly_active_minutes | very_active_minutes | sedentary_active_distance | light_active_distance | moderately_active_distance | very_active_distance |
|---|---|---|---|---|---|---|---|---|---|
| 1503960366 | 2020-04-12 | 728 | 328 | 13 | 25 | 0 | 6.06 | 0.55 | 1.88 |
| 1503960366 | 2020-04-13 | 776 | 217 | 19 | 21 | 0 | 4.71 | 0.69 | 1.57 |
| 1503960366 | 2020-04-14 | 1218 | 181 | 11 | 30 | 0 | 3.91 | 0.40 | 2.44 |
| 1503960366 | 2020-04-15 | 726 | 209 | 34 | 29 | 0 | 2.83 | 1.26 | 2.14 |
| id | date | total_steps |
|---|---|---|
| 1503960366 | 2020-04-12 | 13162 |
| 1503960366 | 2020-04-13 | 10735 |
| 1503960366 | 2020-04-14 | 10460 |
| 1503960366 | 2020-04-15 | 9762 |
| id | date | total_sleep_records | total_minutes_asleep | total_time_in_bed |
|---|---|---|---|---|
| 1503960366 | 2020-04-12 | 1 | 327 | 346 |
| 1503960366 | 2020-04-13 | 2 | 384 | 407 |
| 1503960366 | 2020-04-15 | 1 | 412 | 442 |
| 1503960366 | 2020-04-16 | 2 | 340 | 367 |
| id | date | weight_kg | weight_pounds | fat | bmi | is_manual_report | log_id |
|---|---|---|---|---|---|---|---|
| 1503960366 | 2020-05-02 | 52.6 | 115.9631 | 22 | 22.65 | True | 1462233599000 |
| 1503960366 | 2020-05-03 | 52.6 | 115.9631 | NA | 22.65 | True | 1462319999000 |
| 1927972279 | 2020-04-13 | 133.5 | 294.3171 | NA | 47.54 | False | 1460509732000 |
| 2873212765 | 2020-04-21 | 56.7 | 125.0021 | NA | 21.45 | True | 1461283199000 |
as we can see the data almost cleaning but now we want to merge the data to complete my cleaning and analysis it later
I will going to merge five Dataframes
1-daily_activity
2-daily_calories
3-daily_intensities
4-daily_steps
5-daily_sleep
because :
fisrt_list <- list(daily_activity, daily_calories)
fisrt_list<-fisrt_list %>% reduce(full_join, by=c("id","date","calories"))
second_list <- list(fisrt_list, daily_intensities)
second_list<-second_list %>% reduce(full_join, by=c("id"
,"date","sedentary_minutes","lightly_active_minutes"
,"fairly_active_minutes","very_active_minutes"
,"sedentary_active_distance","light_active_distance"
,"moderately_active_distance","very_active_distance"))
Third_list <- list(second_list , daily_steps)
Third_list<-Third_list %>% reduce(full_join, by=c("id","date","total_steps"))
Fourth_list <- list(Third_list , daily_sleep)
Fourth_list<-Fourth_list %>% reduce(full_join, by=c("id","date"))
weight_info <- list(Fourth_list , weight_info)
weight_info<-weight_info %>% reduce(full_join, by=c("id","date"))
First I will see If there any nulls or duplicated values
sum(is.na(Fourth_list)) ## TO SEE HOW NULL VALUSE WE GET
## [1] 1590
sum(duplicated(Fourth_list)) ## TO SEE HOW duplicated ROWS WE GET
## [1] 3
Fourth_list<-Fourth_list[!duplicated(Fourth_list), ] ##Delete duplicate Rows
weight_info<-weight_info[!duplicated(Fourth_list), ] ##Delete duplicate Rows
Fourth_list<-drop_na(Fourth_list) ##Delete Nulls Value
weight_info<- subset( weight_info, select = -fat ) ##For delete fat
weight_info<-drop_na(weight_info) ##Delete Nulls Value
weight_info$weekday<- weekdays(weight_info$date) ## add column
Fourth_list$weekday<- weekdays(Fourth_list$date) ## add column
reference click
so i will going to determine the total of the steps of each person in this data based in the figure above
count_steps<-Fourth_list %>%
mutate(total_steps =
case_when(
total_steps < 2700 ~ "Baseline",
total_steps >= 2700 & total_steps < 5900 ~ " -46% Mortailty Rate ",
total_steps >= 5900 & total_steps < 8500 ~ "-53% Mortailty Rate",
total_steps >= 8500 ~ "-66% Mortailty Rate"
)
)
count_steps <- count_steps %>%
count(total_steps) %>%
arrange(n)
count_steps %>%
e_charts(x = total_steps) %>%
e_bar(n, name = " Total of Mortailty Rate") %>%
e_toolbox_feature(
feature = "magicType",
type = list("line", "bar")
) %>%
e_theme("chalk") %>%
e_theme_custom('{"color":["#ff715e","#ffaf51"]}') %>%
e_tooltip(trigger = "axis")
Fourth_list$weekday<- weekdays(Fourth_list$date)
fig <- Fourth_list %>%
plot_ly(
x = ~total_steps,
y = ~calories,
color = ~calories ,
size = ~ calories,
frame = ~weekday,
hoverinfo = "text",
type = 'scatter',
mode = 'markers'
)
fig <- fig %>% layout(
xaxis = list(
type = "log"))
fig
as you can see in the above graph there’s two things :
the step rate is almost the same except some user at Saturday and Thursday have more activity than the other days
the calories is almost the same except some user at Saturday and Thursday and Tuesday have more activity than the other days
we can also see many of the user have higher steps rate but low calories !? so in this case i want to discover what are the another factors that affect on the calories .
Fourth_list |>
e_charts() |>
e_boxplot(very_active_minutes) |>
e_boxplot(lightly_active_minutes) %>%
e_boxplot(sedentary_minutes ) %>%
e_boxplot(fairly_active_minutes) %>%
e_theme_custom('{"color":["#ff715e","#ffaf51"]}')
Fourth_list%>%
e_charts(calories ) %>%
e_bar(very_active_distance) %>%
e_bar(light_active_distance) %>%
e_bar(logged_activities_distance) %>%
e_bar(moderately_active_distance) %>%
e_tooltip()%>%
e_datazoom(start = 50) %>%
e_toolbox_feature(
feature = "magicType",
type = list("line", "bar"))
sleep_per_hours<-Fourth_list %>%
mutate(total_minutes_asleep =
case_when(
total_minutes_asleep < 420 ~ "Less than 7h",
total_minutes_asleep <= 540 ~ "7h to 9h",
total_minutes_asleep > 540 ~ "More than 9h"
)
)
sleep_per_hours <- sleep_per_hours %>%
count(total_minutes_asleep) %>%
arrange(n)
sleep_per_hours%>%
e_charts(x = total_minutes_asleep) %>%
e_pie( n , radius = c("50%", "70%")) %>%
e_title("Total time of sleep ",
textAlign = "center",
left ="50%") %>%
e_labels(show = TRUE,
formatter = "{c} \n {d}%",
position = "inside") %>%
e_legend(right = 0,
orient = "vertical")
Fourth_list %>%
select(total_minutes_asleep, total_steps) %>%
mutate(sleep_level = case_when(
total_minutes_asleep < 420 ~ "Less than 7h",
total_minutes_asleep <= 540 ~ "7h to 9h",
total_minutes_asleep > 540 ~ "More than 9h")) %>%
mutate(steps_level = case_when(
total_steps < 2700 ~ "less than 2700 steps ",
total_steps >= 2700 & total_steps < 5900 ~ " between 2700 and 5900 steps ",
total_steps >= 5900 & total_steps < 8500 ~ "between 5900 and 8500 steps",
total_steps >= 8500 ~ "more than 8500 steps")) %>%
select(-c(total_minutes_asleep, total_steps)) %>%
group_by(sleep_level, steps_level) %>%
summarise(counts = n()) %>%
ggplot(aes(x = sleep_level,
y = counts,
fill = sleep_level)) +
geom_bar(stat = "identity") +
theme_classic() +
scale_fill_manual(values=c("#4285F4", "#DB4437", "#F4B400")) +
facet_wrap(~steps_level, nrow = 1) +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(strip.text = element_text(colour = 'black', size = 8)) +
theme(strip.background = element_rect(fill = "beige", color = 'black'))+
labs(
title = "Sleep vs. steps",
x = "Sleep level",
y = "Count"
)
<
weight_info %>%
select(weight_kg, total_steps) %>%
mutate(steps_level = case_when(
total_steps < 2700 ~ "less than 2700 steps ",
total_steps >= 2700 & total_steps < 5900 ~ " between 2700 and 5900 steps ",
total_steps >= 5900 & total_steps < 8500 ~ "between 5900 and 8500 steps",
total_steps >= 8500 ~ "more than 8500 steps")) %>%
mutate(weight_level = case_when(
weight_kg <60 ~ "Less 60 kg ",
weight_kg > 60 ~ "more than 60 kg")) %>%
select(-c(total_steps, weight_kg)) %>%
group_by(steps_level,weight_level) %>%
summarise(counts = n()) %>%
ggplot(aes(x = weight_level, y = counts, fill = weight_level)) +
geom_bar(stat = "identity") +
theme_classic() +
scale_fill_manual(values=c("#4285F4","#0F9D58")) +
facet_wrap(~steps_level, nrow = 1) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(legend.position = "none") +
theme(strip.text = element_text(colour = 'black', size = 8)) +
theme(strip.background = element_rect(fill = "beige", color = 'black'))+
labs(title = "User weight vs. steps rate",
x = " weight ",
y = "Count")
we can see two things
there more activity in users with more 60 Kg a then of users less of 60 kg
the user in more 60 kg has much steps rate more than 8500 steps
my final conclusion based on my analysis is :
more of users that share their information with more than 60 kg
has much steps rate than people with less than 60 kg .
my
recommendation is : targeting young people who have less than
60 kg to motivate them to participate with walking
people with less than 7h of sleep has much steps of rate .
my recommendation is : we can invite them to take a
rest and get better sleep rate
people who have lowest calories burned have lightly distance.
my recommendation is : we can make people with “lightly
distance” to decrease their “lightly distance” by put limited time
beween the take rest and the walk .
some of people has higher steps rate but low calories
my recommendation is : due to the lightly active
minutes so we need to put notification in our smart watch to
notification users about “lightly time minutes” when it repeat many
times . .